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Abstract 


In this paper, we develop a parameter estimation method for factorially 
parametrized models such as Eactorial Gaussian Mixture Model and Eactorial 
Hidden Markov Model. Our contributions are two-fold. Eirst, we show that the 
emission matrix of the standard Eactorial Model is unidentifiable even if the true 
assignment matrix is known. Secondly, we address the issue of identifiability by 
making a one component sharing assumption and derive a parameter learning al¬ 
gorithm for this case. Our approach is based on a dictionary learning problem of 
the form X = OR, where the goal is to learn the dictionary O given the data ma¬ 
trix X. We argue that due to the specific structure of the activation matrix R in the 
shared component factorial mixture model, and an incoherence assumption on the 
shared component, it is possible to extract the columns of the O matrix without 
the need for alternating between the estimation of O and R. 


1 Introduction 


In a typical Gaussian Mixture Model (GMM), each data item is associated with a single Gaussian 
mean, which assumes that only a single cause is active for each observation. While a GMM may be 
appropriate for some applications such as clustering, it is not expressive enough for modeling data 
which possess dependency on multiple variables. 

In a factorial representation of hidden state variables, each data item is dependent on AT > 1 vari¬ 
ables, where each of which is chosen according to a separate hidden variable EEm. In the case 
where the state variables are independent, we call this model a Factorial Mixture Model. If there 
exists a first order temporal dependence between the state variables, it becomes the well-known 
Factorial Hidden Markov Model (EHMM) |l4l|5l- Eactorial HMMs have found use in numerous 
unsupervised learning applications such as source separation in audio processing i), de-noising in 
speech recognition 121, vision |0, and natural language processing 19]. 

Although factorial models have been used extensively in practice, parameter learning is mainly lim¬ 
ited to search heuristics such as variational Expectation Maximization (EM) algorithm and Markov 
Chain Monte Carlo (MCMC) methods lHISl, which iterate between the coordinate-wise updates for 
each parameter until local convergence. These methods require good initializations and require an 
indefinite number of iterations. 

In this paper we have two main contributions. We first show that it is impossible to recover the true 
emission matrix of a factorial model even if we have the true assignment matrix. We then present an 
algorithm which finds a global solution under incoherence and one-component sharing assumptions 
for the emission parameters. 
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1.1 Notation 


We use the MATLAB colon notation A{:,j),A{j, :), which in this case respectively picks the j’th 
column/row of a matrix A. We use the subscript notation xi-t to denote {xi,X 2 , • ■ •, xt}- A proba¬ 
bility simplex in is denoted by := {{pi,p 2 ,... ,pn) G : Pi > 0 Vi, = !}• 

We denote the space of column stochastic matrices of size N x M with ^ indicator 

function is denoted by l{arg): If arg is true then the output is 1, otherwise the output is zero. For 
a positive integer N, let [N] := {1,..., N}. We also use square brackets to concatenate matrices: 
Let A e B G then [A B] G The N dimensional indicator vector is denoted 

by Bi G R^, where only the i’th entry is one and the rest is zero. All-zeroes and all-ones vectors 
of length N are respectively denoted by Ojv and Ijv- The element wise multiplication of matrices A 
and B is denoted hy A ■ B. For two vectors a, 6 G R^, we denote the inner product operation with 
(o, b) = a^b. The identity matrix in R^^^ is denoted by 1^. 

1.2 Definitions and Background 

Gaussian Mixture Model (GMM): In a GMM, observations xt G are generated conditioned 
on latent state indicators rj G Bi-m, such that 

Xt = Ort -f e, Vf G [T], (1) 

where the latent state indicators are i.i.d, Pr(rt = e^) = -Ki, Vf G [T], O = E[a:t|rt] G R^^^ is the 
emission matrix, and e is a zero mean Gaussian noise with covariance matrix E G R^^^. Although 
e may depend on the cluster indicator rj in the general case, we show it to be fixed in this equation 
to make the transition to a factorial model clearer. 

Factorial Gaussian Mixture Model (F-GMM): Different from a GMM, in an F-GMM an obser¬ 
vation is conditioned on a collection of state variables Rt = [(r^ )^, (rf )^,..., where 

G Bi.j^j{k), Vfc G [K]. Without loss of generality, to keep the notation uncluttered we assume that 
= M, \/k G [K], An observation xt is the sum of K vectors chosen by Rt'. 

Xt = + Vf G [T], (2) 

where G e ~ A/'(0, E), and Pr(rj = e^) = Vt G [T]. We denote the assignment 

matrix formed by Ri-t with R G 

Factorial Hidden Markov Model (F-HMM): The only difference between an F-HMM and F- 
GMM is the dependency structure of the latent state indicators. In an F-HMM, state indicators 
are not independent, but have a Markovian dependency such that Pr(r*^_]^ = Bi \r^ = B,) = Al^, 
where A G is the transition matrix of the fc’th chain. The observation model is exactly 

the same as F-GMM, and is given by Equation ([^. 

The proposed learning algorithm in this paper is mainly based around the dictionary learning 
problem of the form X = OR + e, where the dictionary matrix (or the emission matrix) O = 
[O^, O^, ..., O^] G R^^^^ is composed of concatenations of individual dictionaries, and the as¬ 
signment matrix R G consists of K sparse vectors Ri-t- Given the data matrix X G 

the learning goal is to estimate the dictionary matrix O^, ..., O^], upto permutation of the 
columns within each block and upto permutation of the blocks. 

Background: The naive approach for dictionary learning is based on alternating minimization. The 
basic idea is to alternate between the estimation of the dictionary and the assignment matrix until 
convergence. Examples include cni [m na. These approaches are only guaranteed to converge 
locally. 

There exist only few algorithms in the literature which can estimate the dictionary matrix without an 
alternating minimization scheme. In O, an exact recovery algorithm is proposed. The proposed 
algorithm requires the assignment matrix to be sparse and to have a norm preserving property, and 
the dictionary matrix to be square. In lfT4l [TSl Uhl global algorithms are proposed for learning 
latent variable models, which correspond to the cases where the columns of R are 1-sparse indicator 
vectors. Consequently, these algorithms cover the GMM case but not E-GMMs and E-HMMs. 
More recently, an algorithm based on computing pairwise correlations between observations to find 
overlapping components is proposed in El. The algorithm requires all of the dictionary elements 
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to be incoherent from each other, which may be limiting in our case. Our algorithm is similar in 
the sense that it uses correlations to extract the components. However, it is based on the specific 
correlation structure of the factorial models that we work on. 


2 Identifiability 


As stated earlier, the learning goal is to estimate the dictionary matrices = 

[Hi, ..., yk G \K\, upto permutation of the columns of s^ch dictionary, and upto 
permutation of the dictionaries. We assume that the individual emission matrices have full column 
rank rank(0*) = M. Unfortunately the emission matrix of a Gaussian factorial model in its orig¬ 
inal form in |l2l|4]|5l is unidentifiable: Even if an oracle gives the true assignment matrix R, there 
are infinitely many plausible dictionary matrices O. We will show that the assignment matrix R is 
rank deficient, which will lead us to the conclusion of unidentifiability. 


Lemma 1. LetR'^ G ^mk^m 


denote a matrix whose columns consist of all possible combinations 


Rt can take (e.g. for M = 2, K = 2 case R‘^ 
MK -{K - 1). 


ei 

ei 


ei 62 

62 6i 


62 

62 


). We conclude that rank{R'^) = 


Proof: We will show this by computing the dimensionality of the left null space of Let, 
k / \ I if Tnk = ™ 

where k G [K], and m G \M]. This function returns the (fc — 1)M + m’th row of the column of 
R'^ that corresponds to the combination represented by the tuple {mi, m 2 , ■ ■ ■, mk, ■ • ■, mK), where 
TOfc G [M], For a vector a G G null((i?‘^)^), by definition Let us consider 

the structure of such a: 

KM K 

TO2, . . . , TOfc, . . . , TOif) = ^ = 0 (3) 

k—1 m—1 k—1 

So, we see that the sum of the elements a that correspond to different fc’s should sum up to zero. 
Furthermore for a tuple that only differs in k’th element: 

K M 

Y Y . ,mk, ■.. ,mK) = ^ , =0, (4) 

k=lm=l k'^k 

where m^, f fhk, and Vfc G [K], By comparing Equations ([^ and 0, we see that 
And consequently = a%,,, G [M], and V/c G [K], Together with the constraint 

YY=i = 0 , we conclude that dim(null((i?‘^)^)) = K — Therefore, from the rank-nullity 
theorem, rank(i?°) = KM — dim(null((i?^)^)) = KM — {K — 1). □ 

Corollary 1. The rank of the assignment matrix R G is upper bounded: rank{R) < 

KM -{K - 1). 

Proof: The columns of the assignment are such that Rt = R‘^ei, I G [M^]. If R happens 
to contain all columns of R‘^, it achieves the rank of R'^. In the case where R does not contain 
all columns of i?^, its rank is smaller than KM— {K—1). Therefore, rank(i?) < KM— {K—1). □ 

Theorem 1. Given an assignment matrix R G the emission matrix of a Gaussian 

factorial model is not identifiable, meaning there exists Oi f O 2 G such that 

nLiAf(a:i|Oii?t,S) = nLi-^(xt|02i?t,S). 

Proof: We observe that {xt\OiRt,'S) = A/’(xt|02f?i, S), if {Oi — 02)Rt = 0, 

yt G [r], which is equivalent to (Oi — 02)R = 0. Due to Corollaryj^ dim(null(i?^)) > AT — 1. 
Therefore we conclude that {Oi — 02)R = 0 for Oi f O 2 . □ 
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We also intuitively see the model is unidentifiable since there dxc KM vectors to estimate in O but 
we only have KM — {K — 1) linearly independent equations, as Corollary [^suggests. Making 
this observation, we reduce the number of model parameters to KM — {K — 1) by setting a shared 
component = s, Vk G [K], where s G 


Definition 1. (The Shared Component Factorial Model - SC-FM) The emission matrix of a SC- 
FM is of the form O = ..., ..., s], where G and s G is 

the shared component. The latent state indicators are either an indicator vector or an all ze¬ 
ros vector: G (Om-i U Ci m-i)- The columns of the assignment matrix R are of the form 

Rt = mv, 


, (ff )T, K - = e™)] 


^M -1 . 


Lemma 2. Let G i))xm denote a matrix whose columns consist 

all possible combinations Rt can take (e.g. for M = Z, K = 2 case R'^ 

Cl ei 62 62 6 i 62 O 2 O 2 O 2 


61 62 61 62 O2 O2 61 

0 0 0 0 1 1 1 


62 

1 


O 2 

2 


). We conclude that rank{R‘^) = KM — {K — 


of 


1 ). 


and consequently rank{R) < KM — {K — 1). 


Proof: We will prove this by showing that the left null space of R only contains an all-zeroes vector. 
Let, 

~k i \ f 1 , if TTT-k = tn 

and q{mi,m 2 , ■ ■ •,mfc, ■ ■ ■, rriK) := K — J2k=i f 0) fo^ ^ ^ [R^]’ in G [M — 1] and 

TO/c G 0U[M—1]. The first function represents the first (M—l)iL rows, and the second function rep¬ 
resents the last row of R^, for the column that corresponds to the tuple {mi, m 2 , ■ ■ ■, mk, • ■ ■, Wic)• 
For a vector a G in the left null space of a^R^ = 0j\jic. Let us evaluate 

Sfc.m «lr^(TOi,m 2 ,... ,mfe,..., mK)+agq(mi, m 2 ,... ,mfc,.. .,mx) for the tuple ( 0 ,... , 0 ): 

K M-l 

E E aLr^(0,---,0) + agq(0,...,0)=Kag = 0 (5) 

k—1 m—1 

So we conclude that Og = 0. Next, we do the evaluation for the tuple (0,..., mk,..., 0), where 
only one element is not equal to zero: 

K M-l 

E E '^LrL(0,---,mk,...,0)-l-agq(0,...,mk,...,0) = -h (K - l)ag. ( 6 ) 

k—1 m—1 

By comparing Equations (j^ and (|^, we see that = 0, Vk G [K] and Vm G \M — 1]. So, we 
conclude that dim(null((i?'^)^)) = 0 , and therefore from the rank-nullity theorem, rank(i?'^) = 
KM — {K — 1). And, if R contains all columns of R‘^ it has the same rank, which is the upper limit. 
□ 

Theorem 2. Given an assignment matrix R which contains all columns of R‘^, the emission matrix 
of an SC-FM is identifiable. 

Proof: After going through the same reasoning in Lemma [T] we again end up with the condition 
of having the term {Oi — 02 )R not equal to zero for two different emission matrices Oi O 2 for 
identifiability. As we have seen in Lemma dim(null(i?^)) = 0 in the case where R contains 
all possible assignment vectors. Therefore we conclude that {Oi — 02)R f 0 for 0\ 7 ^ O 2 , and 
consequently the emission matrix of an SC-FM is identifiable, given an assignment matrix R. □ 

This theorem shows that the mapping O ^ X = OR is one-to-one. Even though this is the case, it 
is still not trivial to extract the columns of the emission matrix O from the observed data X, simply 
because we do not have R. However, we know the structure of i?'^, which contains all possibilities 
for the columns of i?'^. In the next section we will describe an algorithm which uses this fact. 
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3 Learning 


What we propose for learning is the following; We first calculate an estimate for with a cluster¬ 
ing stage. Naturally, columns of contains an arbitrary and an unknown permutation, which leads 
us to the system X'^H = where If € '^kmxkm ^ permutation matrix. This system has 

a different solution for different If matrices, and therefore we cannot solve this system for the true 
emission matrix unless we know If. However, by assuming that the shared component s is less cor¬ 
related to the non-shared components than the correlation between the non-shared components, we 
will show that it is possible to extract the components by computing pairwise correlations between 
the columns of X^. 

To reduce the notation clutter we drop tilde’s, although we still refer to the SC-FM parameters, and 
we use the regular factorial model notation where the indicator variable G [M], for k G [K\. 
Conforming with that notation we set the last columns of all the emission matrices to be the shared 
component, such that = s, Vfc G [K], E.g., for M = 2, K = 2 case O = [nl, s, /xf, s]. 

3.1 Learning the emission matrix from X'^ 

In this section we describe an algorithm which extracts the columns of the emission matrix by 
looking at the pairwise correlations of the columns of X‘^ matrix. The hrst step is to hnd which 
column of X‘^ corresponds to the shared component. 

Definition 2. Let xi denote I’th column of X^, so xi := X‘^{:,1) = Y^k=iY^m=i i + 

where i, I G [M^] denotes the m’th entry of an indicator vector of length M 
where only the m ’th entry is one and the rest is zero, for the k ’th emission matrix and I ’th possible 
combination. 

Definition 3. Let v{xi') : denote a vector valued function with the argument xii, such 

thatv{xv) = uj {[{xi,xi') , {x 2 ,xi>) ,..., {xi,xi>) ,..., {xmk,xi')]), where u : [M^] -)• [M^] is 
an ascending sorting mapping such that vi^xc) < V 2 {xii) < • • • < Vmk{xi'), where vi(xii) is the 
I’th smallest element in vixu) vector. 

Lemma 3. If ^ G [K], and y{m,m',m") G [M — 1], i.e. 

for any component the least correlated component is s, and < (s, s), Vfc G [K\, 

m G [M — 1], i.e., the shared component s has a non-trivial magnitude (e.g. all zeros vector doesn’t 
satisfy this condition), then 

Ks= argmin ^ vi{xi'), for M > 2, K > 1. (7) 

Proof Sketch: We want to show that given that the specified incoherence conditions are satished, 
the sum of the smallest (M — 1)^ terms in {{xi,xii) : I G [M^]} get minimized when we set 
xi> = Ks. In the proof given in supplemental material, we consider all possibilities for xv and 
conclude that the minimizing possibility is Ks. 

Lemma suggests that by computing pairwise correlations, it is possible to identify the column 
in X‘’ which corresponds to Ks component: The summation of hrst (M — 1)^ terms in v{xi') is 
minimized when we set xu = Ks. Therefore, we compute v{xir) for all columns of X^, and assign 
the minimizing column to the term Ks. In M = 2 case argmin of this summation contains multiple 
minimizers (including Ks), and we suggest a hx for that specihc case with an additional assumption 
in the supplemental material. Now that we know how to estimate the Ks term, next we look at the 
structure of v{Ks) to extract the non-shared components. 

Definition 4. Let Bk' '■= {I G [M^] : i = K'}, i.e. the indices I for which s appears 

K — K' times, which corresponds to the terms of the form X)m=i i + ~ K')s, 

I G Bk'. 

Lemma 4. Let Bf := Em=i + {K - K')s,Ks^, I G Bk'- If{s,pt) < {s,s), 

Wk G [K], andWin G [M — 1], then for — [M — 1)K < V < — 1, vii(Ks) = Bf for 

some I G Bi. 
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Proof: Let us expand the expression : 

K M-1 

k—1 m—1 

Since only K' terms are active on the first term, and due to the condition (s,s) > (s,/r^),Vfc G [K], 
Vm G [M — 1], we see that the above expression reaches the maximum value when K' — 0. By 
the same token, we conclude that B I > Bff , ViG' > 1, Z G Bi, I' G Bk', since the number of 
(s, s) terms decrease as K' increases. Therefore, the largest elements of v{Ks) after Vj^k (Ks) 
correspond to Bf, I G Bi, as suggested by the lemma. □ 


We had an estimate for s in the previous step, and now that we know which observed xi vectors 
corresp ond to the vectors comprised partly of {K — l)s (i.e. terms corresponding to Bi) from 
Lemma 3.1 we can estimate the non-shared components simply by subtracting [K — l)s from each 
term in Bi. The only remaining problem is to group them into proper emission matrices . 


3.1.1 Finding the grouping of the components 

We know from Lemmaj^that the (M — 1)^ smallest elements of v{Ks) (which also correspond to 
Bk) are associated with all possible combinations of non-shared components that do not contain any 
term involving s. To hnd the groupings for the dictionary elements we solve a linear system of the 
form Y = WH for H, where the columns of the W matrix are the non-shared components estimated 
by subtracting {K — l)s from components corresponding to Bi, and columns of Y correspond to all 
possible combinations of the non-shared components which correspond to Bk- Solving this system 
hgures out which combinations of the non-shared components corresponding to Bi add up to the 
combinations corresponding to Bk, which are encoded in H. In practice we have observed that 
solving the following optimization problem which enforces sparsity on the columns of H works 
well; H = a,TgminH\\Y -WH\\f + J2t\\Hi-,t)\\i. 


3.1.2 Summary of emission matrix learning 


For a shared component factorial model (HMM or Mixture model), given the matrix of all pos¬ 
sible observations G 




, and provided that the columns of the emission matrix satisfy 

'j, and < (s, s), V(A:, k', k") G [K], k ^ k' and V(to, m',m") G 

[M — 1], Algorithm hnds the columns of the emission matrix O upto permutation among the 
columns of each emission matrix and permutation of the emission matrices. 






Algorithm 1 Emission matrix learning for F-GMM/F-HMM 

Input: The clustered data matrix G 
Output: Estimated emission matrix O G 

• Compute the correlation matrix = (X'^(:, f), X°(:, j)), Vf, j G . 

• Let C® denote the C matrix with sorted rows in increasing order. Set i* = 

argmin^EE^' C®^-, v = C'®(:,f*), ands = X^(:,i*)/K. 

• Find the indices of (M — 1)K largest elements in v, write the indices in Bi- Set W = X‘^(: 

• Find the indices of (M — 1)^ smallest elements in v, write the indices in Bk- Set Y = X‘^(: 

,Bk)-^ ^ _ 

• Set H = arg min^^ ||X — WH\\p + ll^(h Oil i’ group the columns of W according to 

Hind- 

• Output the corresponding estimate O- 


3.2 On Estimating X° 

Even though the number of clusters is large, if the data is high dimensional then the initial 
clustering step can be done accurately. Let dij := {X‘^{:, i) + ei) — {X‘^{:,j) + Cj), where Cj ^ 
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Af{0, <j^Il)- Notice dij is normally distributed such that, 

since ei, ej are independent and spherical. Due to the concentration property of the Gaussians [18] 
the distribution of \\dij — E[dij]|||, will get concentrated around a thin shell of radius \/^a such 
that, 

Pr{\\\d^,j -E[dij]\\l - 2cr^-^| > c2a‘^L) < 2exp(-LcV24), (8) 

where c > 0 is a constant. This bound means that the magnitude of the noise on the pairwise 
distances between the true combinations X^ gets bounded by 2a^L for high dimensional data. Note 
that, in the case of correlated Gaussians the concentration property still holds around an elliptical 
shell na. A naive clustering approach such as running a randomly initialized k-means clustering 
can still fail, but a carefully crafted clustering algorithm such as m will return the true X‘^ with 
high probability given that min^ dij > a^/L, and the smallest mixing weight is 


3.3 Estimating the auxiliary parameters 
Hidden state parameters: 

Once we have an estimate O for the emission matrix, the assignment matrix can be estimated by 
solving the optimization problem, R = argmin^^ \\OR — ll-^(b^)lli- We estimate 

the assignment probabilities for F-GMM, or the transition matrices A^'-^ for F-HMM simply 
by counting the occurrences in R: 

= ei), = e*)l(rf = ej), i,j G [M],k G [K]. 

In practice, R is noisy and the entries are not binary. We threshold the R matrix to make it binary 
before the counting step. 


Covariance matrix: 

Once we have estimates for the emission and the assignment matrix, we subtract the reconstruction 
from the data to make it zero mean. After that the covariance matrix is estimated with the usual 

covariance estimator: E = — (OR)t^ (^Xt — (OR)t'j , where (OR)t denotes the 

reconstruction at time t. 


4 Experiments 

4.1 Synthetic Data 

We conducted experiments with synthetic data generated from shared component factorial model. 
We set M = 3 and K = 2. The columns of the emission matrix are sampled from a Gaussian with 
variance 10. The observation noise variance cr^, data dimensionality L, and number of observations 
T were all varied to compare the behavior of the proposed approach and EM. For the clustering 
step in the proposed approach, we applied the algorithm in ifTOl . For EM, we used 10 restarts with 
dictionaries started at the perturbed versions of the mean of the observed data. We report the result 
of the initialization that resulted in the highest likelihood. As error, we report the euclidean distance 
between the estimated dictionary matrix O and the true dictionary, by resolving the permutation 
ambiguity. Eigure[T] shows various comparisons between the two algorithms in terms of accuracy in 
recovering the true dictionaries and run time. The parameter setup for the fixed variables is shown 
under each figure. We see that the algorithm works much better than EM in general. We also see 
from Eigure[^that the proposed approach is faster, and potentially more scalable than EM. 

4.2 Digit Data 

In this experiment, we work with digit images from the MNIST dataset. We compare the proposed 
dictionary learning approach in Sectionj^with an EM algorithm, on synthetically combined images 
according to the shared component factorial model, where we set M = 4, and K = 2. We generate 
2000 such images. The images are of size 28 x 28. We normalize the pixel values so that they 
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Error vs Noise Variance Error vs Dimensionality Error vs Dataset Size Run Time vs Dataset Size 



(a) Error vs , L = 50, (b) Error vs L, = 0.5, (c) Error vs T, L = 50, (d) Run time vs T, L = 50, 
T = 200. T = 200. cr^ = 0.5. = 0.5. 


Figure 1: Various performance measures for the proposed algorithm and EM on synthetic data 
averaged over 50 trials. 


'L ^ ^ ^6 a. 0 ^ 

6 ^ (p Z. ^ ^ I 


h 


(a) All possible combinations for the observations 


I- 0 / 

H / 



Q € 





(b) Dictionary Learning (c) EM 

Figure 2: Unmixing of synthetically mixed noisy digit images with SC-FM. Figures (b) and (c) show 
the the learned emission matrices for the proposed algorithm and EM. A row in Figures (a) and (b) 
corresponds to the components corresponding to the same group. 


take on values between 0 and 1. We add spherical Gaussian noise with standard deviation a = 
0.22 to every generated image. We initialize the columns of the emission matrix in EM with the 
randomly perturbed versions of the mean of the generated data. We do 10 such random initializations 
and pick the initialization with the highest likelihood. In Figure we show the all noisy versions 
of 16 possible combinations. We also show the reshaped versions of the learned columns of the 
dictionaries for the proposed algorithm and EM. 

We see that the estimates obtained with dictionary learning approach are close to the true digits, 
whereas EM finds a local solution which deviates from the true digits significantly. 


5 Conclusions and Discussion 


In this paper we have shown that the standard factorial model in the literature is not learnable. We 
then proposed an exact algorithm for the case where there is a one column sharing assumption be¬ 
tween K emission matrices. Although we have focused on the one component sharing case in this 
paper, it is possible to derive algorithms for multiple component sharing cases, under certain inco¬ 
herence assumptions as future work. One other interesting future direction is to derive a learning 
procedure which would be able to extract the model parameters with fewer outputs from the clus¬ 
tering stage: We have shown in Section]^ that the number of linearly independent combinations in 
i? is MK — {K — 1), which is much smaller than the number of all possible combinations . 
The challenge would be to identify the correspondences between the observed vectors and the actual 
combination they are associated with. 
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